4 research outputs found
NamedMask: Distilling Segmenters from Complementary Foundation Models
The goal of this work is to segment and name regions of images without access
to pixel-level labels during training. To tackle this task, we construct
segmenters by distilling the complementary strengths of two foundation models.
The first, CLIP (Radford et al. 2021), exhibits the ability to assign names to
image content but lacks an accessible representation of object structure. The
second, DINO (Caron et al. 2021), captures the spatial extent of objects but
has no knowledge of object names. Our method, termed NamedMask, begins by using
CLIP to construct category-specific archives of images. These images are
pseudo-labelled with a category-agnostic salient object detector bootstrapped
from DINO, then refined by category-specific segmenters using the CLIP archive
labels. Thanks to the high quality of the refined masks, we show that a
standard segmentation architecture trained on these archives with appropriate
data augmentation achieves impressive semantic segmentation abilities for both
single-object and multi-object images. As a result, our proposed NamedMask
performs favourably against a range of prior work on five benchmarks including
the VOC2012, COCO and large-scale ImageNet-S datasets.Comment: Tech report. Code: https://github.com/NoelShin/namedmas
ReCo: Retrieve and Co-segment for Zero-shot Transfer
Semantic segmentation has a broad range of applications, but its real-world
impact has been significantly limited by the prohibitive annotation costs
necessary to enable deployment. Segmentation methods that forgo supervision can
side-step these costs, but exhibit the inconvenient requirement to provide
labelled examples from the target distribution to assign concept names to
predictions. An alternative line of work in language-image pre-training has
recently demonstrated the potential to produce models that can both assign
names across large vocabularies of concepts and enable zero-shot transfer for
classification, but do not demonstrate commensurate segmentation abilities. In
this work, we strive to achieve a synthesis of these two approaches that
combines their strengths. We leverage the retrieval abilities of one such
language-image pre-trained model, CLIP, to dynamically curate training sets
from unlabelled images for arbitrary collections of concept names, and leverage
the robust correspondences offered by modern image representations to
co-segment entities among the resulting collections. The synthetic segment
collections are then employed to construct a segmentation model (without
requiring pixel labels) whose knowledge of concepts is inherited from the
scalable pre-training process of CLIP. We demonstrate that our approach, termed
Retrieve and Co-segment (ReCo) performs favourably to unsupervised segmentation
approaches while inheriting the convenience of nameable predictions and
zero-shot transfer. We also demonstrate ReCo's ability to generate specialist
segmenters for extremely rare objects.Comment: Tech report. Code: https://github.com/NoelShin/rec
Nighttime Reflectance Generation in the Visible Band of Satellites
Visible (VIS) bands, such as the 0.675 μm band in geostationary satellite remote sensing, have played an important role in monitoring and analyzing weather and climate change during the past few decades with coarse spatial and high temporal resolution. Recently, many deep learning techniques have been developed and applied in a variety of applications and research fields. In this study, we developed a deep-learning-based model to generate non-existent nighttime VIS satellite images using the Conditional Generative Adversarial Nets (CGAN) technique. For our CGAN-based model training and validation, we used the daytime image data sets of reflectance in the Communication, Ocean and Meteorological Satellite / Meteorological Imager (COMS/MI) VIS (0.675 μm) band and radiance in the longwave infrared (10.8 μm) band of the COMS/MI sensor over five years (2012 to 2017). Our results show high accuracy (bias = −2.41 and root mean square error (RMSE) = 36.85 during summer, bias = −0.21 and RMSE = 33.02 during winter) and correlation (correlation coefficient (CC) = 0.88 during summer, CC = 0.89 during winter) of values between the observed images and the CGAN-generated images for the COMS VIS band. Consequently, our CGAN-based model can be effectively used in a variety of meteorological applications, such as cloud, fog, and typhoon analyses during daytime and nighttime